Goto

Collaborating Authors

 faster neural network straight


Faster Neural Networks Straight from JPEG

Neural Information Processing Systems

The simple, elegant approach of training convolutional neural networks (CNNs) directly from RGB pixels has enjoyed overwhelming empirical success. But can more performance be squeezed out of networks by using different input representations? In this paper we propose and explore a simple idea: train CNNs directly on the blockwise discrete cosine transform (DCT) coefficients computed and available in the middle of the JPEG codec. Intuitively, when processing JPEG images using CNNs, it seems unnecessary to decompress a blockwise frequency representation to an expanded pixel representation, shuffle it from CPU to GPU, and then process it with a CNN that will learn something similar to a transform back to frequency representation in its first layers. Why not skip both steps and feed the frequency domain into the network directly? In this paper we modify \libjpeg to produce DCT coefficients directly, modify a ResNet-50 network to accommodate the differently sized and strided input, and evaluate performance on ImageNet. We find networks that are both faster and more accurate, as well as networks with about the same accuracy but 1.77x faster than ResNet-50.



Reviews: Faster Neural Networks Straight from JPEG

Neural Information Processing Systems

My main concerns were responded to in the the rebuttal (separation between IO/CPU/GPU gains), TFLOP measurements and discussion of related works. I was happy to see that their model (Late-Concat-RFA-Thinner) is still faster than ResNet 50 (approx 680/475 43% gains in Figure 1. This is a pessimistic estimate given that the ResNet 50 RGB needs to also do the inverse DCT to go from DCT coefficients to RGB domain. However, I was a bit surprised to see such a big disconnect between the timing numbers and the TFLOP measurements (Figure 1. b vs Fig 1. c rebuttal). While I trust that the authors timed the models fairly and thus I do not doubt the results, I think this would be worth more investigation. For the related works, the authors did a good discussion of them in the rebuttal, but I find it strange that we had to ask for this.


Faster Neural Networks Straight from JPEG

Gueguen, Lionel, Sergeev, Alex, Kadlec, Ben, Liu, Rosanne, Yosinski, Jason

Neural Information Processing Systems

The simple, elegant approach of training convolutional neural networks (CNNs) directly from RGB pixels has enjoyed overwhelming empirical success. But can more performance be squeezed out of networks by using different input representations? In this paper we propose and explore a simple idea: train CNNs directly on the blockwise discrete cosine transform (DCT) coefficients computed and available in the middle of the JPEG codec. Intuitively, when processing JPEG images using CNNs, it seems unnecessary to decompress a blockwise frequency representation to an expanded pixel representation, shuffle it from CPU to GPU, and then process it with a CNN that will learn something similar to a transform back to frequency representation in its first layers. Why not skip both steps and feed the frequency domain into the network directly?


Faster Neural Networks Straight from JPEG

#artificialintelligence

We were initially a little disappointed that the error in the UpSampling model was higher than the baseline ResNet-50. We hypothesized that the source of this problem was a subtle issue: units in early layers in the DownSampling and UpSampling models have receptive fields that are too large. The receptive field of a unit in a CNN is the number of input pixels that it can see, or the number of input pixels that can possibly influence its activation. Indeed, after examining the strides and receptive fields of each layer in the network, we found that halfway through a vanilla ResNet-50, units have receptive fields of about 70. Just as far through our naively assembled UpSampling model the receptive fields are already 110px, larger because our DCT input layer has [stride, receptive field] of [8, 8] instead of the typical typical [1, 1] pixel input layer.


Faster Neural Networks Straight from JPEG

Gueguen, Lionel, Sergeev, Alex, Kadlec, Ben, Liu, Rosanne, Yosinski, Jason

Neural Information Processing Systems

The simple, elegant approach of training convolutional neural networks (CNNs) directly from RGB pixels has enjoyed overwhelming empirical success. But could more performance be squeezed out of networks by using different input representations? In this paper we propose and explore a simple idea: train CNNs directly on the blockwise discrete cosine transform (DCT) coefficients computed and available in the middle of the JPEG codec. Intuitively, when processing JPEG images using CNNs, it seems unnecessary to decompress a blockwise frequency representation to an expanded pixel representation, shuffle it from CPU to GPU, and then process it with a CNN that will learn something similar to a transform back to frequency representation in its first layers. Why not skip both steps and feed the frequency domain into the network directly? In this paper, we modify libjpeg to produce DCT coefficients directly, modify a ResNet-50 network to accommodate the differently sized and strided input, and evaluate performance on ImageNet. We find networks that are both faster and more accurate, as well as networks with about the same accuracy but 1.77x faster than ResNet-50.


Faster Neural Networks Straight from JPEG

Gueguen, Lionel, Sergeev, Alex, Kadlec, Ben, Liu, Rosanne, Yosinski, Jason

Neural Information Processing Systems

The simple, elegant approach of training convolutional neural networks (CNNs) directly from RGB pixels has enjoyed overwhelming empirical success. But could more performance be squeezed out of networks by using different input representations? In this paper we propose and explore a simple idea: train CNNs directly on the blockwise discrete cosine transform (DCT) coefficients computed and available in the middle of the JPEG codec. Intuitively, when processing JPEG images using CNNs, it seems unnecessary to decompress a blockwise frequency representation to an expanded pixel representation, shuffle it from CPU to GPU, and then process it with a CNN that will learn something similar to a transform back to frequency representation in its first layers. Why not skip both steps and feed the frequency domain into the network directly? In this paper, we modify libjpeg to produce DCT coefficients directly, modify a ResNet-50 network to accommodate the differently sized and strided input, and evaluate performance on ImageNet. We find networks that are both faster and more accurate, as well as networks with about the same accuracy but 1.77x faster than ResNet-50.